Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make changes to try to prevent memory bugs in parcel resolver #9885

Merged
merged 33 commits into from
Aug 1, 2024

Conversation

yamadapc
Copy link
Contributor

@yamadapc yamadapc commented Aug 1, 2024

This PR adds a copy of the resolver package that moves towards owned data structures rather than using arenas and borrowing.

The crate is copied so that it can be feature-flagged.

Benchmarks have been added which demonstrate the performance profile of the resolver, which is mostly IO bound, and demonstrating that it performs only marginally slower than it previously did avoiding copying/allocation (single digit % diff).

Preliminary end-to-end performance testing on a large application shows no difference to overall build time at this point.

The feature-flag is in place to validate the impact on other metrics, including the frequency of crashes we are experiencing.

@yamadapc yamadapc changed the title WIP - Make changes to try to prevent memory bugs in parcel resolver Make changes to try to prevent memory bugs in parcel resolver Aug 1, 2024
@yamadapc yamadapc requested a review from a team August 1, 2024 03:55
Copy link
Contributor

@mattcompiles mattcompiles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll definitely want to clean this up ASAP once we've tested it.

@devongovett
Copy link
Member

Can you use cargo flags for this instead of forking the whole crate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Criterion benchmark results HTML file

These results were taken on a EC2 c7i.8xlarge instance. These vary quite a bit as the test-case is very fast. We should set-up more complex test cases with more files per iteration, to also be benchmarking the caches and so on.

The best nºs we'll get will be from production, when the flag is rolled-out.

Benchmarking FileSystem - check for non existent file using stat (exists): Collecting 100 samples in estimated 5.0021 s (9.3M iteraFileSystem - check for non existent file using stat (exists)
                        time:   [531.77 ns 533.04 ns 534.47 ns]
                        change: [-1.1872% -0.9784% -0.7506%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  7 (7.00%) high severe

Benchmarking FileSystem - check for non existent file open (read_to_string): Collecting 100 samples in estimated 5.0074 s (3.1M iteFileSystem - check for non existent file open (read_to_string)
                        time:   [1.5756 µs 1.5821 µs 1.5890 µs]
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

Run safe resolver simple OsFileSystem
                        time:   [53.090 µs 53.399 µs 53.688 µs]
                        change: [-1.2921% -0.8435% -0.3762%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) low severe
  5 (5.00%) high mild
  4 (4.00%) high severe

Run safe resolver modules OsFileSystem
                        time:   [78.842 µs 80.227 µs 81.887 µs]
                        change: [-2.4456% -1.4033% -0.2646%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 21 outliers among 100 measurements (21.00%)
  14 (14.00%) low mild
  1 (1.00%) high mild
  6 (6.00%) high severe

Run unsafe resolver simple OsFileSystem
                        time:   [48.232 µs 48.455 µs 48.658 µs]
                        change: [-1.9564% -1.4375% -0.9269%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high severe

Run unsafe resolver modules OsFileSystem
                        time:   [73.793 µs 74.305 µs 74.923 µs]
                        change: [+2.6275% +3.9922% +5.5641%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

Benchmarking Run safe resolver simple - PreloadingFileSystem - No IO: Collecting 100 samples in estimated 5.0030 s (126k iterationsRun safe resolver simple - PreloadingFileSystem - No IO
                        time:   [23.365 µs 23.547 µs 23.760 µs]
                        change: [-0.8148% -0.0801% +0.6690%] (p = 0.84 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

Benchmarking Run safe resolver modules - PreloadingFileSystem - No IO: Collecting 100 samples in estimated 5.0041 s (106k iterationRun safe resolver modules - PreloadingFileSystem - No IO
                        time:   [29.443 µs 29.469 µs 29.498 µs]
                        change: [-2.1236% -1.6713% -1.2740%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  3 (3.00%) high mild
  1 (1.00%) high severe

Benchmarking Run unsafe resolver simple - PreloadingFileSystem - No IO: Collecting 100 samples in estimated 5.1314 s (146k iteratioRun unsafe resolver simple - PreloadingFileSystem - No IO
                        time:   [18.784 µs 18.800 µs 18.820 µs]
                        change: [-0.2920% -0.1053% +0.0995%] (p = 0.30 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

Benchmarking Run unsafe resolver modules - PreloadingFileSystem - No IO: Collecting 100 samples in estimated 5.0471 s (121k iteratiRun unsafe resolver modules - PreloadingFileSystem - No IO
                        time:   [25.081 µs 25.370 µs 25.792 µs]
                        change: [+0.6886% +1.4525% +2.2694%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 17 outliers among 100 measurements (17.00%)
  9 (9.00%) high mild
  8 (8.00%) high severe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are Apple M1 Pro results:

FileSystem - check for non existent file using stat (exists)
                        time:   [674.93 ns 689.44 ns 712.65 ns]
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) high mild
  1 (1.00%) high severe

FileSystem - check for non existent file open (read_to_string)
                        time:   [1.2387 µs 1.2618 µs 1.2946 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Run safe resolver simple OsFileSystem
                        time:   [161.04 µs 163.25 µs 166.15 µs]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

Run safe resolver modules OsFileSystem
                        time:   [244.64 µs 249.86 µs 256.41 µs]
Found 10 outliers among 100 measurements (10.00%)
  8 (8.00%) high mild
  2 (2.00%) high severe

Run unsafe resolver simple OsFileSystem
                        time:   [158.88 µs 161.76 µs 164.91 µs]
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

Run unsafe resolver modules OsFileSystem
                        time:   [233.01 µs 237.07 µs 243.17 µs]
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low severe
  4 (4.00%) high mild
  9 (9.00%) high severe

Run safe resolver simple - PreloadingFileSystem - No IO
                        time:   [18.144 µs 18.418 µs 18.842 µs]
Found 17 outliers among 100 measurements (17.00%)
  6 (6.00%) high mild
  11 (11.00%) high severe

Run safe resolver modules - PreloadingFileSystem - No IO
                        time:   [24.043 µs 24.276 µs 24.489 µs]
Found 19 outliers among 100 measurements (19.00%)
  8 (8.00%) high mild
  11 (11.00%) high severe

Run unsafe resolver simple - PreloadingFileSystem - No IO
                        time:   [13.251 µs 13.273 µs 13.298 µs]
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  10 (10.00%) high severe

Run unsafe resolver modules - PreloadingFileSystem - No IO
                        time:   [18.525 µs 18.591 µs 18.672 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

pub invalidate_on_file_create: DashSet<FileCreateInvalidation>,
pub invalidate_on_file_change: DashSet<PathBuf>,
pub invalidate_on_file_create: RwLock<HashSet<FileCreateInvalidation>>,
pub invalidate_on_file_change: RwLock<HashSet<PathBuf>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DashSet should be fine. This seems unrelated to the cache change. Putting the entire hashset behind a lock will be much slower.

Copy link
Contributor Author

@yamadapc yamadapc Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @devongovett, from what I've measured, this isn't slower ; we won't accept any significant regression.

I'm keen to remove a few of the bits here if possible. DashSet should be fine if we trust it's implementation. But we need to simplify a few parts to improve reliability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes you think this is causing issues? Seems unrelated the potential memory safety problems you mentioned.

Dashmap is a super popular crate. It just shards the map into multiple disjoint maps under individual locks that can be accessed concurrently. I've previously measured improvements on benchmarks by using it. The resolver is quite hot and I did a lot of performance tuning on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to reduce the surface area of what can break.

As a follow-up or in this diff, I can include more benchmarks that try to replicate the performance gains that you've found. We can commit them onto the criterion suite and I can run them on both macOS and Linux/EC2.

Hash(String),
Package(String, String),
Builtin(String),
Url(String),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems quite unrelated from the cache changes.

Copy link
Contributor Author

@yamadapc yamadapc Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is trying to simplify the code.

I understand Cow is here as a performance optimisation, but it doesn't seem like this makes a difference to performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems unrelated to the safety issues though. Small things like not copying specifiers do add up in my experience. Perhaps it doesn't appear to make a difference right now because the rest of your build is a larger portion of the overall time than the resolver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you know +/- what types of benchmarks you've ran? I would be keen to add some benchmarks we can use to measure this.

Note that on the benchmarks I've posted above, there is between 1-10% difference of the overall time. However, since this is quite fast as is I don't that will translate to a meaningful difference in build time. I'm quite confident we can optimise this much more by optimising IO and JSON parsing.

I've added a few notes on the benchmark suite I added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd need to run them again, it was during development several years ago. Just seems strange to remove optimizations that we already implemented that aren't causing any particular problem.

Copy link
Contributor Author

@yamadapc yamadapc Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our intention indeed isn't to remove any optimisations. We are only trying to measure things. I can add DashMap back quite easily ; but I wanted to measure what allocation will do here.

I was asking what you've benchmarked so we can understand and reproduce your findings. Regardless of whether you run them again, we would want to know what types of benchmarks you're running and what the results are.

Copy link
Member

@devongovett devongovett Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually end-to-end benchmarks like esbuild's one or https://github.com/web-infra-dev/bundler-benchmark

@yamadapc
Copy link
Contributor Author

yamadapc commented Aug 1, 2024

Can you use cargo flags for this instead of forking the whole crate?

That'd be better, but if I do that can I switch between the two variants at runtime?

I think it won't make a big difference as the intention is just to do a safer/measured roll-out, then either revert or roll-forward shortly after.

@devongovett
Copy link
Member

can I switch between the two variants at runtime?

Not runtime but compile time, eg in canary or a dev build. Otherwise the binary size will increase due to having two resolver implementations as well.

@yamadapc
Copy link
Contributor Author

yamadapc commented Aug 1, 2024

Not runtime but compile time, eg in canary or a dev build. Otherwise the binary size will increase due to having two resolver implementations as well.

@devongovett The reason the crate is copied is that I need to be able to switch at runtime, so that we can roll this out internally measuring the outcomes.

Once we do that, we will have solid nºs on what impact this change has.

@devongovett
Copy link
Member

Why does it need to be runtime rather than just releasing a different dev/canary version and testing with that?

@yamadapc
Copy link
Contributor Author

yamadapc commented Aug 1, 2024

Switching at runtime will help us:

  • run the benchmarks that are already attached to this diff, which run new/old side by side
  • do a feature rollout safely internally collecting metrics
  • run other benchmarks over our build, which I'll share results for
  • makes the change simpler as I just had to copy directories

@yamadapc
Copy link
Contributor Author

yamadapc commented Aug 1, 2024

This is a critical problem for us at the moment, so I'm merging and iterating.

@yamadapc yamadapc merged commit 6f80bf1 into v2 Aug 1, 2024
17 checks passed
@yamadapc yamadapc deleted the pyamada/safe-resolver-final branch August 1, 2024 23:05
@devongovett
Copy link
Member

I think we may need to cut a separate branch for your team to work from so that v2 is always in a stable releasable/auditable state (or as much as possible). It's fine to do experiments or hotfixes in your own projects or in canary/dev builds, but v2 is getting messy and it is blocking stable releases. Seems like you don't really use stable releases anyway given that we haven't done one since February, so if you just had a branch that canary releases were published from and we merged stuff into v2 once it is stable that might be better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants